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Community structure plays a significant role in the analysis of social networks and similar graphs, 
yet this structure is little understood and not well captured by most models. We formally define 
a community to be a subgraph that is internally highly connected and has no deeper substructure. 
We use tools of combinatorics to show that any such community must contain a dense Erdos-Renyi 
(ER) subgraph. Based on mathematical arguments, we hypothesize that any graph with a heavy- 
tailed degree distribution and community structure must contain a scale free collection of dense 
ER subgraphs. These theoretical observations corroborate well with empirical evidence. From this, 
we propose the Block Two-Level Erdos-Renyi (BTER) model, and demonstrate that it accurately 
captures the observable properties of many real- world social networks. 



INTRODUCTION 

Graph analysis is becoming increasingly prevalent in 
the quest to understand diverse phenomena like social 
relationships, scientific collaboration, purchasing behav- 
ior, computer network traffic, and more. We refer to 
graphs coming from such scenarios collectively as inter- 
action networks. A significant amount of investigation 
has been done to understand the graph-theoretic proper- 
ties common to interaction networks. Of particular im- 
portance is the notion of community structure. Interac- 
tion networks typically decompose into internally well- 
connected sets referred to as low conductance or high 
modularity cuts [1, 2]. Moreover, many graphs have high 
clustering coefficients [3], which is indicative of underly- 
ing community structure. Communities occur in a vari- 
ety of sizes, though the largest community is often much 
smaller than the graph itself [4, 5]. Community analysis 
can reveal important patterns, decomposing large collec- 
tions of interactions into more meaningful components. 



A Theory of Communities 

One metric of the quality of a community is the mod- 
ularity metric [2]. There are other measures such as 
conductance [6], but they are equivalent to modularity 
in terms of our intentions. Consider a graph G (undi- 
rected) with n vertices and degrees (ii, 0^25 • • • , <^n- Let 
m = ^ X]r=i denote the number of edges. We say a 
subgraph S has high modularity if S contains many more 
internal edges than predicted by a null model, which says 
vertices i and j are connected with probability did j /2m. 
(Technically, the probability is mm{l^ did j /2m) ^ but we 
keep the notation simple for clarity.) We refer to the null 
model as the CL model, based on its formalization by 
Chung and Lu [7, 8]; see also Aiello et al. [9] and the 
edge-configuration model of Newman et al. [10]. 

Given a high modularity subgraph 5, we say it is a 
module if it does not contain any further substructures 
of interest; in other words, it is internally well- modeled 
by CL. Formally, assume S has r nodes with internal de- 



grees (ii, (^2, . . . , <^r and let the number of edges in S be 
denoted hy s = ^ Yl^i=i ^i- Consider the CL model on 
where edge {i^j) occurs with probability didj/2s. We call 
S a module if the induced subgraph on S (the subgraph 
internal to S) is modeled well by this CL model. Looking 
at the contrapositive, if S is not a module, then S itself 
contains a subset of vertices that should be separated 
out. A module can be thought of as an "atomic" sub- 
structure within a graph. In this language, we can think 
of community detection algorithms as breaking a graph 
into modules. This discussion is not complete, however, 
since communities are not just modules, but also inter- 
nally well-connected. 

Interaction networks have an abundance of triangles, 
a fact that Watts and Strogatz [3] succinctly express 
through clustering coefficients. Barrat and Weigt [11] 
defined this as 

^ 3 X total number of triangles 
total number of wedges ' 

where a wedge is a path of length 2 [1, 3]. It has been 
observed that C "has typical values in the range of 0.1 
to 0.5 in many real- world networks" [1]. Moreover, our 
own studies have revealed that the node-level clustering 
coefficient (first used in [3]), defined by 

^ number of triangles incident to node i 
* number of wedges centered at node i ' 

is typically highest for small degree nodes. Large clus- 
tering coefficients are considered a manifestation of the 
community structures. Naturally, we expect the trian- 
gles to be largely contained within the communities due 
to their high internal connectivity. 

We now formally define a community to be a mod- 
ule with a large internal clustering coefficient. More for- 
mally, we say a module is a community if the expected 

number of triangles is more than {tz/3) (2*), for some 
constant k,. In other words, a community is tightly con- 
nected internally and has few external links. A graph has 
community structure if it (or at least a constant fraction 
of it) can be broken up into communities. The benefit 



of this formalism is that we can now try to understand 
what graphs with community structure look like. 

Let us first begin by just focusing on a single commu- 
nity. It seems fairly intuitive that a community cannot 
be large while comprising only low degree vertices nor 
that it consists of a single high-degree node connected 
to degree-one vertices (a star). We can actually prove a 
structural theorem about a community, given our formal- 
ization. Recall that an Erdos-Renyi (ER) graph [12, 13] 
on n vertices with connection probability p is a graph 
such that each pair of vertices is independently connected 
with probability p. If p is a constant, we call this a dense 
ER graph; if p = 0(l/n), then we call this a sparse ER 
graph. Using triangle bounds from extremal combina- 
torics and some probabilistic arguments, we can prove 
the following theorem. 

Theorem 1. A constant fraction of the edges in a com- 
munity are contained in a dense Erdos-Renyi graph. 
More formally, if the community has s edges, then there 
must be fl{^/s) vertices with degree Q{^/s). 

This theorem is interesting because even though it is 
well known that ER graphs are not good models for in- 
teraction networks, they nonetheless form an important 
building block for the communities. We interpret this 
theorem as saying that the simplest possible community 
is just a dense ER graph. Building on this simple intu- 
ition, we think of an interaction network as consisting of 
a large collection of dense ER graphs. 

This leads naturally to a question about the distribu- 
tion of sizes of these ER components. For that, consider 
the power law degree distribution observed by Barabasi 
and Albert [14] and others. They show that interaction 
graphs exhibit heavy-tailed degree distributions such as 

Xd (X d-^ (3) 

where Xd is the number of nodes of degree d and 7 is the 
power law exponent. 

Suppose we packed nodes with a heavy-tailed degree 
distribution into a collection of dense ER graphs. A com- 
munity of with s edges would be a dense ER graph of ^/s 
vertices of degree y^, so the size (in vertices) of the com- 
munity is exactly ^/s. Setting d = ^/s^ the number of 
such communities is proportional to 

n/d^ n 
~~d~ " ~d^' 

This forms a scale- free distribution of communities, ex- 
actly as observed by many studies on community struc- 
ture [4, 5]. Hence, we hypothesize that real-world inter- 
action networks consist of a scale-free collection of dense 
Erdos-Renyi graphs. This is consistent with most of the 
important observed properties of these networks. 

Our analysis immediately leads to connections with 
Dunbar's celebrated result on "mean group sizes" of hu- 
mans (reported to be around 148 with 95% confidence 
limits of 100-231). Empirically, this has been reported 



by a variety of studies [4, 5, 15, 16]. If there exists a 
community of size d^ it must satisfy njd^^^ > 1. For the 
maximum community size J, we have J ~ n^/^^+^^. For 
n being a million and 7 = 2, we get an estimate for a 
100, surprisingly close to Dunbar's estimate. 

As an aside, Thm. 1 also proves that CL by itself is 
not a good model for interaction networks. Suppose the 
entire graph G (with m edges) can be modeled as a CL 
graph. Since G has a high clustering coefficient, then G 
itself is a module. Hence, G must have ^t{^/rn) vertices 
with degree Q{^/m)^ but this violates the tail behavior of 
the degree distribution. 



The BTER model 

Based on the idea of a graph comprising ER com- 
munities, we propose the Block Two-Level Erdos-Renyi 
model (BTER). The advantages of the BTER model are 
that it has community structure in the form of dense 
ER subgraphs and that it matches well with real-world 
graphs. We briefiy describe the model here and provide a 
more detailed explanation and comparisons to real- world 
graphs in subsequent sections. 

The first phase (or level) of BTER builds a collection 
of ER blocks in such a way that the specified degree dis- 
tribution is respected. The BTER model allows one to 
construct a graph with any degree distribution. Real- 
world degree distributions might be idealized as power 
laws, but it is by no means a completely accurate de- 
scription [17, 18]. When the degree distribution is heavy 
tailed, then the BTER graph naturally has scale-free ER 
subgraphs. The internal connectivity of the ER graphs is 
specified by the user and can be tuned to match observed 
data. 

The second phase of BTER interconnects the blocks. 
We assume that each node has some excess degree after 
the first phase. For example, if vertex i should have di 
incident edges (according to the input degree distribu- 
tion), and it has edges from its ER block, then the 
excess degree is di — d'-. We use a CL model (which can 
be considered as a weighted form of ER) over the excess 
degrees to form the edges that connect communities. 



Previous models 

There are many existing models for social networks and 
other real-world graphs. We give a short description of 
some important models; for more details, we recommend 
the survey of Chakrabarti and Faloutsos [19]. Classic ex- 
amples include preferential attachment [14], small- world 
models [3], copying models [20], and forest fire [21]. Al- 
though these models may produce heavy-tailed degree 
distributions, their clustering coefficients of the former 
three models are often low [22]. Even for models that 
give high clustering coefficients, it is difficult to predict 
their community structure in advance. Because of their 
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unpredictable behavior, it is not possible to match real 
data with these graphs. This makes it difficult to vali- 
date against real- world interaction networks. Moreover, 
none of these models explain community structure, one 
of the most striking features of interaction graphs. 

A widely used model is the Stochastic Kronecker 
Graph model (known as R-MAT in an early incarnation) 
[23, 24]. Notably, it has been selected as the generator for 
the Graph 500 Supercomputer Benchmark [25]. Though 
it has some desirable properties [24], it can only gen- 
erate lognormal tails (after suitable addition of random 
noise [26]) and does not produce high clustering coeffi- 
cients [22, 27]. Mult ifr act al networks are closely related 
to the SKG model [28]. The random dot product model 
[29, 30] can be made scalable but has never been com- 
pared to real social networks. There have been successful 
dendogram based structures that perform community de- 
tection and link prediction in real graphs [31, 32]. The 
recent hyperbolic graph model [33, 34] is based on hyper- 
bolic geometry and has been used to performing Internet 
routing. 

The stochastic block model [35] has been used to gener- 
ate better algorithms for community detection. A degree 
corrected version [36] has been defined to deal with im- 
precisions in this model. A key feature of these models is 
that they break the graph into a constant number of rela- 
tively large blocks, and our theory shows that this model 
does not give a satisfactory explanation of the clustering 
coefficients of low degrees (which constitute a majority of 
the graph). The LFR community detection benchmark 
[37] is also somewhat connected to this model, since it de- 
fines a set of communities and has probabilities of edges 
within and between these communities. We stress that 
these models do not attempt to match real graphs, nor do 
they explain the scale- free nature of communities [4, 5]. 
Our hypothesis and model are very different from these 
results, because we use a mathematical formalization to 
prove the existence of a scale-free dense ER collection, 
and the BTER model follows this theory. Nonetheless, 
our model can be seen as an extension of these block mod- 
els, where the number and sizes of blocks form a scale- free 
behavior. Implicitly, our model can be seen to use a la- 
beling scheme for vertices that depends on the degrees, 
and connecting vertices with probabilities depending on 
the labels (thereby related to the degree corrected frame- 
work of [36]). 



MATHEMATICAL DETAILS 

We provide a sketch of the proof for Thm. 1; a com- 
plete proof is provided in the supplement. Our analysis 
is fundamentally asymptotic, so for ease of notation we 
use the 0{-), ^(O? ®(') suppress constant factors. 
The notation A <^ B indicates that there exists some 
absolute constant c such that A < cB. We let S denote 
the community of interest and assume that the internal 
degree distribution of the community S is (ii , (i2 , • • • , <^r • 



We denote the number of edges in by 5 = ^ Yll=i ^i- 

Based on the given distribution, let T denote the ex- 
pected number of triangles in S. Since this is a commu- 
nity, we demand that T be at least tz/3 times the expected 
number of wedges, for some constant k,. This means that 

T>{n/3)Y^(^^, (4) 

where, for convenience, we define (2) = when d = 1. 
Let j be the ffist index such that dj > 1. We assume that 
^i>3 3 ~ ^(5]]i<r^i)' (which effectively means there 
are more wedges than degree 1 vertices). We can bound 

(^2^ > ^7/4, and so T = l^(Ei<^)- A key fact we use 
is the Kruskal-Katona theorem [38-40] which states that 
if a graph has T triangles and s edges, then T < s^/^. 
Combining, we have 

^d7«.3/2. (5) 

i 

Now, let us count the expected number of triangles based 
on the CL distribution. For any triple (z,j, /c), let Xijk 
be the indicator random variable for (z, j, k) being a tri- 
angle. This occurs when all the edges (^,j), (j, /^), and 
(/c, z) are present, and by independence, this probability 
is PijPjkPki^ where pij = didj/2m. The expected number 
of triangles T can be expressed as E[^.^^.^^ ^ijk]^ which 
(by linearity of expectation) is X]i<j</c -^[^^jfc]- There- 
fore, 

J. _ didj_ djdk djdk ^ {^i^i) 

i<j<k 

We argued earlier that T = ^(Xli^f)- We can put this 
bound in (6) and rearrange to get 

s'/'«J2d^ (7) 

i 

This is the exact reverse of (5)! This means that these 
quantities are the same up to constant factors. When 
can this be satisfied? If the community consists of ^/s 
vertices all with degree y^, then ^^d'f = ^^s = 5^/^, 
and the conditions are exactly satisfied. Intuitively, to 
satisfy both (5) and (7), there have to be 6(v^) vertices 
of degree Q{^/s). These vertices form a dense ER graph 
within the community proving that each community in- 
volves a constant fraction of the edges in an ER graph. 

THE BTER MODEL IN DETAIL 

The BTER model comprises an interconnected scale- 
free collection of communities. Intuitively, short-range 
connections (Phase 1) tend to be dense and lead to large 
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(a) Preprocessing: Distribution (b) Phase 1: Local links within (c) Phase 2: Global links across 

of nodes into communities each community communities 



FIG. 1: BTER Model Construction. In the preprocessing phase, the nodes are divided into communities. In 
Phase 1, within-community hnks are generated using the ER model. In Phase 2, across-community links are 

generated using the CL model on the excess degrees. 



clustering coefficients. Long-range connections (Phase 2) 
are sparse and lead to heavy-tailed degree distributions. 
We describe the steps in detail below. 

Preprocessing In the preprocessing step, each node of 
degree 2 or higher is assigned to a community. We as- 
sume the desired degree distribution {di} is given where 
di denotes the desired degree of node i. Roughly speak- 
ing, d vertices of degree close to d are assigned to a com- 
munity (though in reality, it is somewhat involved than 
that because of high degree vertices.) The vertices are all 
partitioned into these communities, which has a scale- free 
behavior. Since the degree distribution is an input to the 
model, this step is relatively straightforward and results 
in a structure as shown in Fig. la. We let Qk denote the 
kih. community and ki denote the community assignment 
for node i. 

Phase 1 The local community structure is modeled 
as an ER graph on each community. This is illustrated 
in Fig. lb. The connectivity of each community is a pa- 
rameter of the model. By observing the clustering coef- 
ficient plots for real graphs, we can see that low degree 
vertices have a much higher clustering coefficient than 
higher degree ones. This suggests that small communi- 
ties are much more tightly connected than larger ones, 
and so we adjust the connectivity accordingly. Any for- 
mula may be used; we have found empirically that the 
following works well in practice. We let the edge proba- 
bility for community k be defined as 



the excess degree, e^, of each node, which is computed as 
follows: 



Pk = P 



log(4 + 1) 



log(dn 



1) 



(8) 



where dk = min { di \ i G Qk }^ <^max is the maximum de- 
gree of any node in the entire graph, and and p and r] 
are parameters that can be selected for the best fit to a 
particular graph. (These were selected by manual experi- 
mentation for our results, but more elaborate procedures 
could certainly be developed.) 

Phase 2 The global structure is determined by inter- 
connecting the communities. We apply a CL model to 



1, 

di - pki{\Qk, \ - 1), 



if di = 1, 
otherwise. 



(9) 



where \Qk\ is the size of community k. Given the e^'s for 
all nodes, edges are generated by choosing two endpoints 
at random. Specifically, the probability of selecting node 
i is Ci/ Cj . It is possible to produce duplicate links or 
self-links, but these are discarded. Phase 2 is illustrated 
in Fig. Ic. 

Reference implementation A MATLAB reference im- 
plementation of BTER is included in the supplementary 
material, including scripts to reproduce the findings in 
this paper. In this implementation, we have taken some 
care to reduce the variance in the CL model with respect 
to degree-one nodes. We also generate extra edges in 
Phase 2 to account for expected repeats and self-loops 
that are removed. These details are described in detail 
in the supplementary materials. 



RESULTS 

We consider comparisons of the BTER model with four 
real- world data sets from the SNAP collection [41]. All 
the graphs are treated as undirected. Properties of these 
data sets are shown in Tab. I. We compare BTER with 
the real data as well as the corresponding CL model. 
Fig. 2 shows results on a collaboration network on 124 
months of data from the astrophysics (ASTRO-PH) sec- 
tion of the arXiv preprint server. Here, the edge proba- 
bilities in the communities are given by (8) with p = 0.95 
and T] = 0.05. Fig. 3 shows results on a who-trusts-whom 
online social (review) network from the Epinions website. 
Here, the edge probabilities in the communities are given 
by (8) with p = 0.70 and r] = 1.25. Comparisons on two 
additional datasets listed in the table are provided in the 
supplement. 

In the leftmost plots of Fig. 2 and Fig. 3, we see the 
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FIG. 2: Properties of ca-AstroPh, a co-authorship network from astrophysics papers, compared with the BTER and 
CL models. Observe the close match of the clustering coefficients of the real data and BTER, in contrast to CL. 
Additionally, the eigenvalues of the BTER adjacency matrix are close to those of the real data. 
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FIG. 3: Properties of soc-Epinionsl, a social network from the Epinions website, compared with the BTER and CL 
models. In this case, the clustering coefficients are much smaller overall, but the BTER model is still a closer match 
to the real data than CL in terms of both the clustering coefficient and the eigenvalues of the adjacency matrix. 



comparison of the degree distributions. The degree dis- 
tribution for ca-AstroPh has a slight "kink" mid-way 
and does not conform to any standard degree distribu- 
tion such as lognormal or power law. Nonetheless, both 
BTER and CL are able to match it. The degree dis- 
tribution for soc-Epinions is fairly close to a powerlaw, 
and matched well by both BTER and CL. In fact, these 
models can match any degree distribution. 

The difference between BTER and CL is highlighted 
when we instead consider the clustering coefficient, 
shown in the center plots of Fig. 2 and Fig. 3. As noted 
previously, CL cannot have a high clustering coefficient 
and a heavy tail, and this is evident in these examples. 
BTER, on the other hand, has a close match with the 
observed clustering coefficients. The dense ER graphs 
ensure that all nodes have high clustering coefficient. 

The importance of matching the clustering coefficients 
becomes apparent when considering other features of the 
graph such as the eigenvalues of the adjacency matrix, as 
shown in the rightmost plots of Fig. 2 and Fig. 3. For ca- 
AstroPh, the BTER eigenvalues are a much closer match 
than the CL eigenvalues because the community behavior 
is significant {C = 0.32). For soc-Epinionsl, the differ- 



ence between the models in terms of the eigenvalues is 
less dramatic because the community behavior is much 
less evident {C = 0.07); nonetheless, BTER is still a 
closer match. 



DISCUSSION 

We define a community to be a subgraph that is inter- 
nally well-modeled by CL (and thus has no further sub- 
structure) and highly interconnected (so that it has many 
triangles). We prove that any community must contain 
a dense ER subgraph. Therefore, any graph model that 
captures community structure must contain dense sub- 
structures in the form of dense ER graphs. This observa- 
tion leads naturally to the BTER model, which explicitly 
builds communities of varying sizes and simultaneously 
generates a heavy tail. 

Fitting the BTER model to real-world data is straight- 
forward. The community sizes and composition in BTER 
are determined automatically according to the degree dis- 
tribution. We currently use a simplistic procedure that 
assumes that all nodes in the same community have the 
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same expected degree. Undoubtedly, this is an unreal- 
istic assumption, but the variance of the model ensures 
that the degrees within a community vary considerably 
and Phase 2 adds connections between nodes of widely 
varying degrees. The connectivity of each ER block is 
a user-tunable parameter that can be adjusted to fit ob- 
served data. We currently prescribe a simple formula (8) 
and fit by trial and error, but the procedure could cer- 
tainly be automated. Moreover, there is no particular 
requirement that pk be exactly the same for all commu- 
nities with the same (minimum degree) nor that pk 
be computed by a deterministic formula. 

Our experimental results show that BTER has proper- 
ties that are remarkably similar to real- world data sets. 
We contend that this makes BTER an appropriate model 
to use for testing algorithms and architectures designed 
for interaction graphs. In fact, BTER is even designed to 
be scalable. In particular, in Phase 2 we could compute 
the exact excess degree and use a matching procedure 
to complete the graph. The advantage of computing the 
excess degree in expectation is that it is more easily paral- 



lelized. In that case, the assignment to communities, the 
community connectivity, and the expected excess degree 
can all be computed in the preprocessing stage. Both 
Phase 1 and Phase 2 edges can be efficiently generated 
in parallel via a randomized procedure. Therefore, the 
BTER model is suitable for massive-scale modeling, such 
as that needed by Graph 500 [25]. The details of this 
implementation are outside the scope of the current dis- 
cussion but will be considered in future work. 

Our formalism captures the more advanced notion of 
link communities [42] (where edges, rather than vertices, 
form communities). This allows vertices to participate 
in many communities. The notion of communities uses 
modules over internal degrees^ so one can easily imagine 
a vertex in many communities. Thm. 1 is still true, and 
we still get a scale-free collection of ER graphs which 
may share vertices. We believe it is an interesting di- 
rection to extend BTER to link (and hence overlapping) 
communities. 
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TABLE L Data sets for empirical validation 





Vertices 


Edges 


C 


ca-AstroPh [21] 


18,772 


396,100 


0.32 


soc-Epinionsl [43] 


75,879 


811,480 


0.07 


cit-HepPh [44] 


34,546 


841,754 


0.15 


ca-CondMat [21] 


23,133 


186,878 


0.26 
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Supporting Material 

Theoretical details 

The aim of this section is to prove Thm. 1, which we re- 
state (in shghtly different wording) for convenience. We 
set (2) = 0, when d = 1. We wih use smah Greek let- 
ters for constants less than 1, and small Roman letters 
for constants who values may exceed 1. All constants 
considered are positive. We will make no attempt to op- 
timize various constant factors in the proof. The proof 
is asymptotic in s, the number of edges of our commu- 
nity. That means that the proof holds for any sufficiently 
large s. 

Theorem 2. Consider a CL graph with degree sequence 
di < d2 < - - - < dr Ojnd set s = di/2. The quantities 
and K G (0, 1) are constants (independent of s). 

Let j be the smallest index such that dj > 1, and as- 
sume that ^lyjdl > cj. Suppose the expected number 
of triangles generated with this degree sequence is at least 

(/^/3) (2') . Then (for sufficiently large s), there exists 
a set of indices S C {1, . . . ,r}^ such that \S\ = Vt{^/s) 
and \/k e S^dk = ^( V^) • 

(The constants hidden in the rt{-) notation only hide a 
dependence on c and n.) 

The proof of this theorem requires some extremal com- 
binatorics and probability theory. We will first state 
some of these building blocks before describing the main 
proof. Henceforth, the assumptions stated in the theorem 
hold. An important tool is the Kruskal-Katona theorem 
that gives an upper bound on the number of triangles in 
a graph with a fixed number of edges. 

Theorem 3 (Kruskal-Katona [38-40]). // a graph has t 
triangles and m edges, then t < m^/^. 

Since we are dealing with a graph distribution, we need 
some bounds on the expected number of edges. 

Claim 4. Let E denote be the random variable denoting 

the number of edges in the CL graph defined by {di}. 
T/ien E[£^3/2] < 25^/2. 

Proof. Let Xij be the indicator random variable for the 
edge (i, j) being present. Then, E = X]i<j^ij- con- 
struction, ^[E] < s (we get an inequality because of pos- 
sible self-loops) . This is the sum of (2) independent ran- 
dom variables. Applying a multiplicative Chernoff bound 
(Theorems 4.1 and 4.3 of [45]), 

Pr[^ > 2^[E]] < exp(-E[£;]/3) 

Hence, the probability that E >2s\s at most exp(— 5/3). 
Let £ denote the event that E > 2s. We can trivially 
bound E hy v? < s^. Using Bayes' rule, 

E[^3/2] ^ Pr(?)E^[£;^/2]+Pr(£:)E£:[^^/2] 

< (28)3/2 +exp(-5/3)(5^)^/^ < 2^3/2 



□ 



We now prove some claims about the expected number 
of triangles and the degree sequence. 

Claim 5. Let T denote the expected number of triangles. 
The constants P and d depend only on c and n. 



2. <cV/2 



. 2 > 



Proof By assumptions in Thm. 2, T > {.^ 
For (i^ > 1, (2*) > (if/4 (for large di, it is actually much 
closer to d^/2). Hence, T > (^/12) Xli>j ^^f 7 where j is 
the smallest index of a non-degree 1 vertex (as stated in 
Thm. 2). By assumption, E^<J = J < (V^) Ei>, 

and ^id^ < (1 + X]i>7 • We can complete the 
proof of the first part with the following. 

T > (/./12) > {k/12){1 + l/c)-i 

i>j i 

Suppose we generate a random CL graph. Let t be the 
number of triangles and E be the number of edges (both 
random variables). By Thm. 3, t < E'^l^. Taking expec- 
tations and applying Claim 4, T < ^\E'^I^\ < 2s^l^. 
Combining with the first part of this claim, ^ • < 
(2lP)s^l^. □ 

We come to the proof of the main theorem. 

Proof, (of Thm. 2) We choose 6 to be a sufficiently large 
constant, and 7 to be sufficiently small. Let i be the 
smallest index such that dg > h^/s. For a triple of 
vertices {i^j^k), let Xijk be the indicator random vari- 
able for {i^j^k) forming a triangle. Note that T = 
EE i<j<k^ijk]' Then we have, 

i<j<k 

= ^ ^[Xijk] 

i<j<k 



< 



E 

i<j<k 



mm 



didj 
~2^ 



- , 1 X min 



didk 
2s ' 



1 X min 



djdk 

~2r 



,1 



didj didk 



djdk 



i<j<k 



2s 2s 



2s 



< 



E 



8s3 



j<k<e 



■E 

j,k>e 



4s2 



< (E^"? 



8s3 



4s2 
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By the first part of Claim 5, T > PY^^df. For conve- 
nience, we will replace all the independent indices above 
by i 



4s2 



(10) 



By Claim 5, ^.-d^^ < ds^l^. Furthermore, Y.i^\ > 
^^/s^^y^di. Combining the two bounds, Y^y^di < 
{c' /h)s. Applying these bounds in (10) and setting con- 
stant r appropriately. 



8^3/2 



{c'/2by 



Y^d?, > (8/c0(/3 - (c726)2)5^/2 = rs 



3/2 



i<£ 



(By setting b to be sufficiently large, we can ensure that 
r is a positive constant.) Let i' be the smallest index 
such that d£' > ^^/m and set s' = X]£/<^<£ di. 

Ts^l^ <Yd\< 7(s - 5^ + hs'^T^ 

=^ rs < s\b — 7) + 75 

=^ s' > s{r - 7)/(6 - 7) = n{s) 

(Again, a sufficiently small 7 ensures positivity.) This 
means that the vertices with indices in [i'^i] are totally 
incident to at least Q{m) edges. All these vertices have 
degrees that are &{^/rn), and hence there are Q{^/rn) 
such vertices. □ 



This update removes the ffrst p nodes from the CL part 
and also slightly raises the probability of an edge for the 
remaining r — p degree- 1 nodes. This modiff cat ions help 
to balance out the fact that some higher degree nodes 
(in expectation) which actually become degree- 1 nodes in 
the ffnal graph, so we need some of the degree- 1 nodes (in 
expectation) to become higher degree in the ffnal graph. 

In Phase 2a, we set aside q < p {q even) degree- 1 ver- 
tices to be connected to other degree- 1 vertices. This 
value can be speciffed by the user or defaults to 



p^ 



which is the expected number of degree- 1-to-degree-l 
edges expected in the CL model. This can be accom- 
plished by randomly pairing the selected vertices. In all 
of our experiments, we use q = 0. 

In Phase 2b, we manually connect the remaining (jp—q) 
vertices to the rest of the graph. For each degree- 1 vertex, 
we select an endpoint proportional to e^. 

In Phase 2c, we ffnally create the CL model. We mod- 
ify the expected degrees to account for the edges used in 
Phase 2b and to account for duplicates. Thus, we update 
ei ^ rjCi where 



77 = 1 



p- 



where f3 is the proportion of duplicates. We use /3 = 0.10 
in our experiments. The total number of edges generated 
in Phase 2c (including repeats and self-edges, which are 
discarded) is LEi^i/^l- 



Implementation details 



Additional experimental results 



We give some speciffcs of the BTER implementation, 
to accompany the included MATLAB implementation. 

In Phase 1, the last "community" generally has fewer 
than dk nodes because we have run out of nodes. We have 
found it convenient to set pk = for the last community 
since it is generally pretty small in any case. 

We split the calculation of the Phase 2 edges into three 
subphases so that we can specially handle the degree- 1 
edges. The variance for degree- 1 vertices in the CL model 
is high, so we set aside a proportion of these vertices to be 
handled "manually." Let r denote the number of degree- 
1 vertices and assumed the vertices and indexed from 
least degree to greatest. By default, 75% of the degree- 
1 vertices are handled "manually" (the exact proportion 
is user-deffnable); let p = [0.75r] denote this quantity 
where [•] denotes nearest integer. We update as fol- 
lows: 

0, for 1 < i < p, 
1.10, forj9+l<z<r, 
e,-, otherwise. 



We consider two additional experiments, as shown in 
Fig. 4 and Fig. 5. The graph Fig. 4 represents all the ci- 
tations within a dataset on the high energy physics phe- 
nomenology section of the arXiv preprint server. For 
Fig. 4, we use an alternate formula for pk as follows: 



Pk = 0.7 



1-0.6 



iog(4 + 1) 



\og{dr, 



1) 



And Fig. 5 shows results on a collaboration network on 
124 months of data from the condensed matter (COND- 
MAT) section of the arXiv server. Here, the edge proba- 
bilities in the communities are given by (8) with p = 0.95 
and T] = 0.95. 

The results on these two graphs are consistent with 
with our earlier results. The leftmost plots show that 
both CL and BTER can match the degree distributions 
of the original graph, as expected. Again, the clustering 
coefficients plots in the middle highlight the strengthen 
of BTER, and how it differs from CL: BTER matches 
the clustering coefficients closely, while CL does not pro- 
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FIG. 4: Properties of cit-HepPh, a citation network of High Energy Physics papers, compared with the BTER and 

CL models. 
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FIG. 5: Properties of ca-CondMat, a co-authorship network of Condensed Matter physics, compared with the BTER 

and CL models. 



duce any significant number of triangles. The rightmost 
column shows that the eigenvalues of the adjacency ma- 
trices of BTER are closer to those of the original graph 
than those produced by CL. 



Codes and data 



The codes and data are available at http://www. 
sandia . gov/~tgkolda/bter_suppleinent/. 
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